Skip to content

fix(operator): targeted reindex after pg_resetwal via amcheck#68

Merged
passcod merged 1 commit into
mainfrom
operator-reindex-after-resetwal
Jun 9, 2026
Merged

fix(operator): targeted reindex after pg_resetwal via amcheck#68
passcod merged 1 commit into
mainfrom
operator-reindex-after-resetwal

Conversation

@passcod

@passcod passcod commented Jun 9, 2026

Copy link
Copy Markdown
Member

🤖

Summary

When the restore init script falls back to pg_resetwal -f (the operator's two-stage WAL-recovery-failure path), it touches a new /pgdata/needs-reindex-all marker. The main container's background-reindex hook picks up that marker and runs a targeted reindex via the amcheck contrib extension: scan every valid btree index, REINDEX only the ones that fail the structural check, plus a blind REINDEX of every non-btree index. All before the readiness probe lets traffic in.

Why not REINDEX DATABASE

A blind REINDEX DATABASE takes hours on the prod-sized DBs (known from prior experience — the prod indexes are many). Most pg_resetwal cases corrupt zero or a handful of indexes; rewriting everything is wasteful.

amcheck reads each btree index page and verifies its invariants — including the "unexpected zero page at block N" case that prompted this. Cost: ~index-size of disk reads. For a clean snapshot it finds nothing and reads only the indexes; for a corrupt one it identifies exactly which need rebuilding.

Caveats:

  • amcheck only covers btree. GIN / GiST / BRIN / hash indexes get a blind REINDEX (typically a small fraction of total index size; safer than skipping).
  • bt_index_check (light variant) takes only AccessShareLock and doesn't block writes.
  • bt_index_check isn't exhaustive — there are corruption shapes it misses. But it does catch zero-page issues, which is the reported failure mode.

Shape

  • postgres_single_or_resetwal (init script) now touches /pgdata/needs-reindex-all at both pg_resetwal invocation sites (detected-WAL-signature branch + last-resort branch).
  • Init script's stage selection treats either marker as "not yet ready".
  • Main container's background reindex hook:
    • needs-reindex-all present → enable amcheck, scan btrees, REINDEX corrupt btrees + all non-btrees. Clears both markers (the locale-only set is a strict subset).
    • else needs-reindex present → existing collation-only loop.
  • Readiness probe waits for both markers to be absent.
  • REINDEX CONCURRENTLY on PG ≥ 14, plain REINDEX on older.

Tests

Three new unit tests:

  • Every pg_resetwal -f "$PGDATA" invocation in the init script is paired with touch /pgdata/needs-reindex-all.
  • The runtime hook's needs-reindex-all branch enables amcheck, calls bt_index_check, collects non-btree indexes, runs targeted REINDEX, and explicitly does not use REINDEX DATABASE.
  • Readiness probe waits on both markers.

@passcod passcod force-pushed the operator-reindex-after-resetwal branch from dfcf7d2 to 34876f1 Compare June 9, 2026 06:57
@passcod passcod changed the title fix(operator): REINDEX DATABASE after pg_resetwal fallback fix(operator): targeted reindex after pg_resetwal via amcheck Jun 9, 2026
pg_resetwal bypasses WAL replay — it tells postgres to act as if the
data dir is consistent without replaying anything. Any index update
in flight at snapshot time can leave torn pages, which postgres later
surfaces as 'unexpected zero page at block N' when a query happens to
hit the corrupted index. Downstream users reported exactly this
against a primary key, with the hint 'Please REINDEX it.'

A blind REINDEX DATABASE takes hours on the prod-sized DBs (known
from prior experience). Instead, use postgres's amcheck contrib
extension to do a structural scan: read each valid btree index,
verify its invariants, queue only the failing ones for REINDEX. Plus
a blind REINDEX of every non-btree index (amcheck only covers btree,
non-btree are usually a tiny fraction of total index size).

Reuses the existing post-restore reindex infrastructure (the
/pgdata/needs-reindex marker, the background hook in the postgres
container startup, the readiness probe gate). Adds a second marker
/pgdata/needs-reindex-all set by every pg_resetwal -f invocation in
the operator's two-stage fallback path. The runtime hook handles
both flags, with the -all branch supplanting the narrow locale-only
branch when both are set.
@passcod passcod force-pushed the operator-reindex-after-resetwal branch from 34876f1 to fa961fd Compare June 9, 2026 07:00
@passcod passcod enabled auto-merge June 9, 2026 07:00
@passcod passcod merged commit d9c6e95 into main Jun 9, 2026
13 checks passed
@passcod passcod deleted the operator-reindex-after-resetwal branch June 9, 2026 07:10
passcod added a commit that referenced this pull request Jun 10, 2026
…SE CONCURRENTLY

The amcheck-driven smart pass introduced in #68 hits the same family
of postgres-internal pathology that wedges other vanilla DDL on the
prod dataset — bt_index_check itself burns 100% CPU forever on
specific indexes (observed via pg_stat_activity: state=active,
empty wait_event, query_start growing linearly with wall clock for
4+ minutes on a tiny system catalog index). The smart pass is
unusable on this data.

Two changes:

1. Drop bt_index_check; revert the needs-reindex-all branch to a
   blind REINDEX DATABASE per user DB, CONCURRENTLY on PG ≥ 12 so
   clients can keep using the old indexes during the rebuild and
   the atomic swap makes corruption disappear without downtime.
   REINDEX reads the heap and rebuilds the index from scratch — a
   different code path from amcheck (which reads corrupt index
   pages directly) — so it isn't subject to the same wedge. Slow
   on prod-sized DBs (hours) but makes progress.

2. Drop the readiness probe's gate on /pgdata/needs-reindex-all.
   With the reindex taking hours, gating readiness here trips the
   operator's 30-minute deployment_ready_timeout and the restore is
   marked Failed before postgres even has a chance to come up. The
   probe still gates on /pgdata/needs-reindex (locale-only,
   finishes in seconds) since that's small and proven.

Trade-off: clients hitting a not-yet-reindexed corrupt index in the
window between pod-Ready and reindex-complete get the explicit
"unexpected zero page" error from postgres. With CONCURRENTLY the
window is narrow (queries hit the old index until the atomic swap)
and clients can retry. Strictly better than the alternative
(permanently failed restore, replica stuck indefinitely on the
previous Active).

The pre-#68 behaviour for the locale-only path is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant